{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example #2 - Working with files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make sure our data files are there. The `os` library will be very helpful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from os import listdir\n",
    "from os.path import join\n",
    "\n",
    "listdir('.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "listdir('../data')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will rely on `pandas`'s built in `read_csv` function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "pd.read_csv?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What other \"read\" functions does `pandas` have?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "regex = re.compile(r'read')\n",
    "list(filter(regex.match, dir(pd)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's focus on `read_csv` for now, and investigate some data from [openflights.org](openflights.org)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "routes = pd.read_csv(join('..', 'data', 'routes.csv'))\n",
    "airports = pd.read_csv(join('..', 'data', 'airports.csv'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "routes.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "routes.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like there's some missing data points...can we visualize what that looks like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import missingno as msno"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`missingno` is a library that wraps together handy data utilities to get a visual summary of data completeness. We will only scratch the surface of this library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "msno.matrix(routes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives a quick display to visually check out your data. The sparkline on the right shows the completeness of the data. Let's check out airports."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "airports.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "msno.matrix(airports)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cool! Overall the csvs look pretty clean, with relatively complete data. We've got ~68000 airline routes and ~7000 airports around the world. Let's see if this list is real by trying to find everyone's favorite airport - LaGuardia (LGA)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "airports[airports['IATA'] == 'LGA']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome. How about finding some other things like the highest airport in the world, highest airport in the country that comes last alphabetically, and the highest airport in Australia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "highest = airports.sort_values('altitude', ascending=False).iloc[0]\n",
    "highestInLastAlpha = airports.sort_values(['country', 'altitude'], ascending=[False, False]).iloc[0]\n",
    "highestInAustralia = airports[airports['country'] == 'Australia'].sort_values('altitude', ascending=False).iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'Highest airport: \\n{highest}\\n\\n')\n",
    "print(f'Highest airport in the country that comes last alphabetically: \\n{highestInLastAlpha}\\n\\n')\n",
    "print(f'Highest airport in Australia: \\n{highestInAustralia}\\n\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do some more practical things with this data like finding the 10 busiest routes in the world."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "busiest = routes.groupby(['source', 'dest']).count()['airline_id'].nlargest(10)\n",
    "busiest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can merge in the names of the airports."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "busiest = pd.DataFrame(busiest).reset_index()\n",
    "busiest.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for sd in ['source', 'dest']:\n",
    "    busiest = busiest.merge(airports[['IATA', 'name']].set_index('IATA'), how='left', left_on=sd, right_on='IATA')\n",
    "\n",
    "busiest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
